93 research outputs found
iMetricGAN: Intelligibility Enhancement for Speech-in-Noise using Generative Adversarial Network-based Metric Learning
The intelligibility of natural speech is seriously degraded when exposed to
adverse noisy environments. In this work, we propose a deep learning-based
speech modification method to compensate for the intelligibility loss, with the
constraint that the root mean square (RMS) level and duration of the speech
signal are maintained before and after modifications. Specifically, we utilize
an iMetricGAN approach to optimize the speech intelligibility metrics with
generative adversarial networks (GANs). Experimental results show that the
proposed iMetricGAN outperforms conventional state-of-the-art algorithms in
terms of objective measures, i.e., speech intelligibility in bits (SIIB) and
extended short-time objective intelligibility (ESTOI), under a Cafeteria noise
condition. In addition, formal listening tests reveal significant
intelligibility gains when both noise and reverberation exist.Comment: 5 pages, Submitted to INTERSPEECH 202
Improving Meeting Inclusiveness using Speech Interruption Analysis
Meetings are a pervasive method of communication within all types of
companies and organizations, and using remote collaboration systems to conduct
meetings has increased dramatically since the COVID-19 pandemic. However, not
all meetings are inclusive, especially in terms of the participation rates
among attendees. In a recent large-scale survey conducted at Microsoft, the top
suggestion given by meeting participants for improving inclusiveness is to
improve the ability of remote participants to interrupt and acquire the floor
during meetings. We show that the use of the virtual raise hand (VRH) feature
can lead to an increase in predicted meeting inclusiveness at Microsoft. One
challenge is that VRH is used in less than 1% of all meetings. In order to
drive adoption of its usage to improve inclusiveness (and participation), we
present a machine learning-based system that predicts when a meeting
participant attempts to obtain the floor, but fails to interrupt (termed a
`failed interruption'). This prediction can be used to nudge the user to raise
their virtual hand within the meeting. We believe this is the first failed
speech interruption detector, and the performance on a realistic test set has
an area under curve (AUC) of 0.95 with a true positive rate (TPR) of 50% at a
false positive rate (FPR) of <1%. To our knowledge, this is also the first
dataset of interruption categories (including the failed interruption category)
for remote meetings. Finally, we believe this is the first such system designed
to improve meeting inclusiveness through speech interruption analysis and
active intervention
Multi-objective Non-intrusive Hearing-aid Speech Assessment Model
Without the need for a clean reference, non-intrusive speech assessment
methods have caught great attention for objective evaluations. While deep
learning models have been used to develop non-intrusive speech assessment
methods with promising results, there is limited research on hearing-impaired
subjects. This study proposes a multi-objective non-intrusive hearing-aid
speech assessment model, called HASA-Net Large, which predicts speech quality
and intelligibility scores based on input speech signals and specified
hearing-loss patterns. Our experiments showed the utilization of pre-trained
SSL models leads to a significant boost in speech quality and intelligibility
predictions compared to using spectrograms as input. Additionally, we examined
three distinct fine-tuning approaches that resulted in further performance
improvements. Furthermore, we demonstrated that incorporating SSL models
resulted in greater transferability to OOD dataset. Finally, this study
introduces HASA-Net Large, which is a non-invasive approach for evaluating
speech quality and intelligibility. HASA-Net Large utilizes raw waveforms and
hearing-loss patterns to accurately predict speech quality and intelligibility
levels for individuals with normal and impaired hearing and demonstrates
superior prediction performance and transferability
- …